Skip to content

[KYUUBI #7028] Persist the kubernetes application terminate state into metastore for app info store fallback #7029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented Apr 16, 2025

Why are the changes needed?

  1. Persist the kubernetes application terminate info into metastore to prevent the event lose.
  2. If it can not get the application info from informer application info store, fallback to get the application info from metastore instead of return NOT_FOUND directly.
  3. It is critical because if we return false application state, it might cause data quality issue.

How was this patch tested?

UT and IT.

image

Was this patch authored or co-authored using generative AI tooling?

No.

@turboFei turboFei changed the title Kubernetes state [KYUUBI #7028] Persist the kubernetes application state into metastore Apr 16, 2025
@turboFei turboFei changed the title [KYUUBI #7028] Persist the kubernetes application state into metastore [KYUUBI #7028] Persist the kubernetes application terminate state into metastore Apr 16, 2025
@turboFei turboFei marked this pull request as draft April 16, 2025 06:34
@turboFei turboFei self-assigned this Apr 16, 2025
@turboFei turboFei force-pushed the kubernetes_state branch 2 times, most recently from 3feec05 to d41eea6 Compare April 16, 2025 07:13
@turboFei turboFei marked this pull request as ready for review April 16, 2025 07:13
@turboFei turboFei changed the title [KYUUBI #7028] Persist the kubernetes application terminate state into metastore [KYUUBI #7028] Persist the kubernetes application terminate state into metastore for kubernetes app info fallback Apr 16, 2025
@turboFei turboFei changed the title [KYUUBI #7028] Persist the kubernetes application terminate state into metastore for kubernetes app info fallback [KYUUBI #7028] Persist the kubernetes application terminate state into metastore for app info store fallback Apr 16, 2025
@codecov-commenter
Copy link

codecov-commenter commented Apr 16, 2025

Codecov Report

Attention: Patch coverage is 0% with 145 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (29b6076) to head (9f2bade).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...yuubi/server/metadata/jdbc/JDBCMetadataStore.scala 0.00% 75 Missing ⚠️
...kyuubi/engine/KubernetesApplicationOperation.scala 0.00% 22 Missing ⚠️
...ubi/server/metadata/jdbc/JdbcDatabaseDialect.scala 0.00% 19 Missing ⚠️
...pache/kyuubi/server/metadata/MetadataManager.scala 0.00% 17 Missing ⚠️
...ubi/server/metadata/api/KubernetesEngineInfo.scala 0.00% 12 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #7029    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files         695     696     +1     
  Lines       42833   42977   +144     
  Branches     5833    5839     +6     
=======================================
- Misses      42833   42977   +144     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@turboFei
Copy link
Member Author

image

Testing passed, cc @pan3793

@turboFei turboFei force-pushed the kubernetes_state branch 2 times, most recently from 5293f09 to fc437aa Compare April 17, 2025 02:55
@turboFei turboFei requested a review from pan3793 April 17, 2025 03:22
@turboFei
Copy link
Member Author

This PR has been well tested. cc @pan3793

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB schema LGTM, the upsert implementation might have improvement room

migrate

upsert

app

app name

Add app name column

comments
|VALUES (${colsToInsert.map(_ => "?").mkString(",")})
|ON CONFLICT ($keyCol)
|DO UPDATE SET
|${colsToReplace.map(c => s"$c = EXCLUDED.$c").mkString(",")}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|INSERT INTO $table (${colsToInsert.mkString(",")})
|VALUES (${colsToInsert.map(_ => "?").mkString(",")})
|ON DUPLICATE KEY UPDATE
|${colsToReplace.map(c => s"$c = VALUES($c)").mkString(",")}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turboFei turboFei requested a review from pan3793 April 25, 2025 17:55
Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for the dialect API style

@turboFei turboFei closed this in 02a6b13 Apr 27, 2025
turboFei added a commit that referenced this pull request Apr 27, 2025
…o metastore for app info store fallback

### Why are the changes needed?

1. Persist the kubernetes application terminate info into metastore to prevent the event lose.
2. If it can not get the application info from informer application info store, fallback to get the application info from metastore instead of return NOT_FOUND directly.
3. It is critical because if we return false application state, it might cause data quality issue.

### How was this patch tested?

UT and IT.

<img width="1917" alt="image" src="https://github.com/user-attachments/assets/306f417c-5037-4869-904d-dcf657ff8f60" />

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7029 from turboFei/kubernetes_state.

Closes #7028

9f2bade [Wang, Fei] generic dialect
186cc69 [Wang, Fei] nit
82ea626 [Wang, Fei] Add pod name
4c59beb [Wang, Fei] Refine
327a0d5 [Wang, Fei] Remove create_time from k8s engine info
12c24b1 [Wang, Fei] do not use MYSQL deprecated VALUES(col)
becf9d1 [Wang, Fei] insert or replace
d167623 [Wang, Fei] migration

Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>
(cherry picked from commit 02a6b13)
Signed-off-by: Wang, Fei <[email protected]>
@turboFei turboFei deleted the kubernetes_state branch April 27, 2025 08:37
@turboFei
Copy link
Member Author

thanks, merged to 1.11.0

@turboFei turboFei added this to the v1.11.0 milestone Apr 27, 2025
pan3793 added a commit that referenced this pull request Apr 27, 2025
…to prevent data quality issue

### Why are the changes needed?

Currently, NOT_FOUND application stated is treated as a terminated but not failed state.

It might cause some data quality issue if downstream application depends on the batch state for data processing.

So, I think we should treat NOT_FOUND as a failed state instead.

Currently, we support 3 types of application manager.
1. [JpsApplicationOperation](https://github.com/apache/kyuubi/blob/master/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/JpsApplicationOperation.scala)
2. [YarnApplicationOperation](https://github.com/apache/kyuubi/blob/master/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/YarnApplicationOperation.scala)
3. [KubernetesApplicationOperation](https://github.com/apache/kyuubi/blob/master/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/KubernetesApplicationOperation.scala)

YarnApplicationOperation and KubernetesApplicationOperation are widely used in production use case.

And in multiple kyuubi instance mode, the NOT_FOUND case should rarely happen.
1.  https://github.com/apache/kyuubi/blob/7e199d6fdbdf52222bb3eadd056b9e5a2295f36e/kyuubi-server/src/main/scala/org/apache/kyuubi/server/api/v1/BatchesResource.scala#L369-L385

3. #7029

So, I think we should treat NOT_FOUND as a failed state in production use case.
It is better to fail some corner cases than to mistakenly set unsuccessful batches to the finished state.

### How was this patch tested?

GA.
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7033 from turboFei/revist_not_found.

Closes #7033

ada4f88 [Cheng Pan] Update kyuubi-server/src/main/scala/org/apache/kyuubi/engine/ApplicationOperation.scala
985e23c [Wang, Fei] Refine
f03d612 [Wang, Fei] comments
b9d6ac2 [Wang, Fei] incase the metadata updated by peer instance
3bd61ca [Wang, Fei] add
339df47 [Wang, Fei] treat NOT_FOUND as failed

Lead-authored-by: Wang, Fei <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants